Statistical tests can help us to obtain statistical rigor in comparing and evaluating data. Here I'm running a few tests with α level = 0.05 on ridership data of raini and non-raini days, to validate the distribution and statistical significance.
In [24]:
import numpy as np
import pandas as pd
import scipy.stats as st
df = pd.read_csv('turnstile_weather_v2.csv', parse_dates=['datetime'])
ent_hr_rain = df[df["rain"] == 1]["ENTRIESn_hourly"]
ent_hr_no_rain = df[df["rain"] == 0]["ENTRIESn_hourly"]
From the histograms in section 0 we already knew that the distribution is right-skewed. To obtain statistic rigor, I'm double checking again. But because Shapiro–Wilk test may not be accurate for N > 5000, here I'm testing the normality with scipy.stats.normaltest which combines skew and kurtosis normality tests. Its p-value is a 2-tail chi squared probability for the hypothesis test. From the result p-values we can reject the null hypothesis that the data was drawn from a normal distribution for both rainy and non-rainy days. Both are non-normal.
In [25]:
# w, p = st.shapiro(df[(df['rain'] == 0) & (df['weekday'] == 1)]['ENTRIESn_hourly'])
# Got warning: p-value may not be accurate for N > 5000.
k2, p = st.normaltest(ent_hr_rain)
print 'RAIN: k2 = {0:.5f}, p-value = {1:.5f}'.format(k2, p)
k2, p = st.normaltest(ent_hr_no_rain)
print 'NO RAIN: k2 = {0:.5f}, p-value = {1:.5f}'.format(k2, p)
In section 0 we already got the stats of hourly entries (every 4 hours) for rainy and non-rainy days:
Hourly entries - rain: Count = 9,585, Total = 19,440,259, Median = 939, Mean = 2,028.20
Hourly entries - no rain: Count = 33,064, Total = 61,020,916, Median = 893, Mean = 1,845.54
Median and mean are both a bit higher for rainy days. But are they statistically higher than non-rainy days? Though when we perform Welch's t-test we usually assume the data is normal, with large enough sample size, we can still run Welch's t-test to see if 2 data sets have the same sample means (μ1 = μ2). Welch's t-test doesn't assume equal sample size nor equal variance. Below the scipy.stats.ttest_ind function returns a 2-tail p-value, which is much smaller than p-critical 0.025. We can reject the ull hypothesis that 2 independent samples have identical average (expected) values. Here t-statistic is greater than 0 and 1-tail p-value/2 is much less than p-critical 0.025. We can confirm that the mean of entries is significantly higher when it rained (μ1 > μ2). This is interesting. From the bar chart in section 0 these 2 data sets don't look too different but they are statistically different.
In [26]:
t, p = st.ttest_ind(ent_hr_rain, ent_hr_no_rain, equal_var = False)
print '2-tail RAIN vs. NO RAIN: t = {0:,.5f}, p-value = {1:.10f}'.format(t, p)
print '1-tail RAIN vs. NO RAIN: p-value/2 = {0:.10f}'.format(p/2)
One way to check how much the factor could affect the results is by calculating a R^2 = t^2 / (t^2 + Degree of Freedom). And the result shows that only about 0.06% of the differences in ridership was due to rain. Though the ridership was statistically higher for rainy days, some other factors are playing important roles too.
In [27]:
r_squared = t*t / (t*t + (9585 + 33064 - 2))
print 'R^2 = t^2 / (t^2 + Degree of Freedom) = {}'.format(r_squared)
Since the data is non-normal, we can try Mann–Whitney U test which is a nonparametric test of the null hypothesis that two populations are the same. Mann–Whitney U test doesn't assume any underlying probability distribution. Below I'm running scipy.stats.mannwhitneyu function to see if these 2 data sets are from the same population. This function returns a 1-tail p-value.
In [28]:
u, p = st.mannwhitneyu(ent_hr_rain, ent_hr_no_rain)
print '1-tail RAIN vs. NO RAIN: u = {0:,.2f}, p-value = {1:.10f}'.format(u, p)
However, I can't get p-value directly using this function. So below I'm calculating p-value maually, following the steps for normal approximation provided in http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#Normal_approximation.
In [29]:
m_u = len(ent_hr_rain) * len(ent_hr_no_rain) / 2
print "m_u = {0:,.1f}".format(m_u)
sigma_u = np.sqrt(len(ent_hr_rain) * len(ent_hr_no_rain) * (len(ent_hr_rain) + len(ent_hr_no_rain) + 1) / 12)
print "sigma_u = {0:,.5f}".format(sigma_u)
z = (u - m_u) / sigma_u
print "z = {0:,.5f}".format(z)
p = 2 * st.norm.cdf(z)
print '2-tail RAIN vs. NO RAIN: u = {0:,.2f}, 2*p-value = {1:.10f}'.format(u, p)